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(54) Method and apparatus for compressing data 



(57) A tag information separating unit separates a 
tag identified from a character train stream and outputs 
as tag information. A tag code replacing unit arranges 
a tag code for identification to a position of the character 
train stream in which the tag was separated. A character 



train coding unit compression encodes the character 
train stream including a tag code which is outputted from 
the tag code replacing unit. A data reconstructing appa- 
ratus reconstructs the character train stream by the op- 
eration opposite to the compression. 
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Description 

BACKGROUND OF THE INVENTION 
Field of the invention 

[0001] The present invention relates to data compressing apparatus, reconstructing apparatus, and its method for 
forming code data from a character train stream constructed by a structured document including tags. More particularly, 
the invention relates to data compressing apparatus, reconstructing apparatus, and its method for separating tag in- 
formation from a character train stream of a structured document and performing a coding and a reconstruction. 

Description of the Related Arts 

[0002] In recent years, various kinds of data such as character codes, image data, and the like is dealt in a computer. 
Further, in association with the spread of the Internet and Intranet, the numbers of E-mail and electronized documents 
are increasing. In such a large amount of data, by compressing the data by omitting redundant portions in the data, a 
storage capacity can be reduced or the compressed data can be sent to a remote place in a short time. 
[0003] The field of the invention is not limited to the compression of character codes but can be applied to various 
data. It is now assumed hereinbelow that the denominations which are used in the information theory, one word unit 
of data is called a character, and data in which an arbitrary plurality of words are connected is called a character train. 
[0004] Recently, there is a trend of unifying formats of documents which are handled on computers. In the trend, to 
efficiently form a document, a method whereby the contents of a document are partially distinguished by using tags, 
a plurality of document parts such as titles, paragraphs, and the like are preliminarily formed, the relations among the 
document parts are determined, and the document is structured and edited is tried. As examples of the structured 
documents such that a concept of a structure is taken in a document, there are structured documents according to the 
standards of ODA (ISO 8613: Open Document Architecture) and SGML (ISO 8879: Standard Generalized Markup 
Language) of international standards. As a document processing method using such a structured document, for ex- 
ample, there is a method of JP-A-5-1 35054. The structured document according to SGML has a high compatibility with 
a conventional text processing system and has been spread mainly from U.S.A. and put into practical use. In the 
structured document according to SGML, a template of the document structure is preliminarily given and the document ' 
structure is limited within the template. 

[0005] Fig. 1 shows a SGML structured document constructed by three portions of SGML declaration 200. document 
type definition (DTD) 202. and document realization value 204. The template which defines the structure of the docu- 
ment Is the document type definition 202. As shown in Fig. 2, the document structure such as chapter, paragraph, title, 
and the like is defined. In the structured document of SGML, In order to express the document structure, a document 
text is divided by using an identifier called a tag in the document text. 
[0006] Fig. 3 shows a specific example of the structured document of SGML. For example, in case of a title of a 
document, it is expressed by "<TITLE> Specification of the Invention (Device) </TITLE>". That is, characters sand- 
wiched by "<TITLE>" as a start tag and "</riTLE>" as an end tag are elements. In this case, the characters show the 
40 title contents "Specification of the Invention (Device)". At present, the number of cases of using SGML is increasing 
mainly from public organizations. Especially, in U.S.A.. the Department of Defense obliges us to submit documents 
described by SGML. In Japan as well, such a structured document is adopted as a CD-ROM Official Gazette of the 
Patent Office. HTML (Hyper Text Markup Language) spread as a description form of WWW (World Wide Web) used 
by the Internet is one form of SGML. 

[0007] As a method of compressing a structured document of such SGML or the like, the applicant of the present 
invention has proposed a method disclosed in Japanese Patent Application LaidOpen No. (JP-A) 9-261072. According 
to the method, when document data of a structured document having tag information is inputted, the tag information 
defined by the document type definition DTD or the like is delected. When the tag Information is detected, the tag 
information is outputted as it is without converting. Further, since the tag information is detected, the operating mode 
is shifted to a mode for coding an input character train except for the tag information. 

[0008] A basic algorithm of the coding is as shown in Fig. 4. First in step S1 . whether an input character or character 
train is identical to the character or character train preliminarily registered in a dictionary or not is retrieved and com- 
pared. If YES, the iriput data Is encoded by a registration number of the dictionary in step S2. In step S3, the code is 
outputted. When the same registered character or character train cannot be retrieved in step 81, the original input 
character or character train is outputted as it is in step S5. The above processes are repeated until there is no input 
character train in step S4. When the SGML document file of Fig. 3 is subjected to the encoding of Fig. 4, a compression 
data file of Fig. 5 is obtained. The compression data file has a form in which a portion of the tag information which is 
not compressed and a portion of a compressed text document mixedly exist In a single file. 



30 



35 



45 



50 



SS 



2 

RN.^nnrtn- <FP nqQimRA^? i > 



EP0 991 018 A2 



[0009] According to a method of compressing the document text, since a document text having an enormous data 
amount can be compressed to a data amount \A^ich can be used in practice, this method is a very useful technique to 
.realize an electronized document text. In the compression data file of the structured document as shown in Fig. 5. 
however, in case of retrieving the tag information in the file, the tag information mixedly exists as a non-compression 

5 portion in the compressed document data. The whole file has to be developed into a memory and the necessary tag 
information has to be retrieved. Even when the user wants to retrieve a keyword in the text as a compressed portion, 
it is similarly necessary to develop the whole file into the memory and process it. In order to retrieve or obtain the 
necessary document from the compression data file of the structured document, therefore, it is necessary to read an 
unnecessary portion as a document, an amount of data to be transmitted increases, it takes time to read the data, and 

10 there is a problem such that a large memory area and a large disk capacity need to be assured. 

SUMMARY OF THE INVENTION 

[0010] According to the invention, there is provided a data compressing apparatus for shortening a time to retrieve 
IS or read a document and minimizing an increase in capacity of a memory or disk with respect to compression data of 
a structured document including tag information. 

[001 1] A target of the invention is a data compressing apparatus for forming code data from a character train stream 
constructed by a document including tags. According to the invention, the data compressing apparatus comprises: a 
tag information separating unit for separating an identified tag from a character train stream and outputting it as tag 

20 information; a tag code replacing unit for arranging a tag code for identification at a position in the character train stream 
from which the tag was separated by the tag information separating unit; and a character train coding unit for encoding 
the character train stream including the tag code outputted from the tag code replacing unit and outputting a code 
stream. According to the data compressing apparatus of the invention, the tag information and the text (character train) 
in the character train stream of the structured document including the tags are separated and at least the text is encoded, 

2S thereby realizing a high compression ratio. 

[0012] By retrieving the separated tag information, the retrieval can be performed at a high speed. For example, the 
tag information separated from the text in the compression data file is retrieved and when the coincident tag information 
can be retrieved, the data is skipped by the data of only the number of data up to the tag information at which the tag 
code in a reconstructed text has been retrieved, thereby enabling the laser beam to easily reach the head of the target 

30 document. 

[001 3] The tag code replacing unit arranges a predetermined fixed code as a tag code at the position in the character 
train stream from which the tag was separated. By using the fixed code as a tag code, the tag position in the text can 
be easily retrieved. The tag code replacing unit arranges the tag code indicative of the appearing order of the tags 
separated by the tag information separating unit at the position in the character train stream from which the tag was 

35 separated. By giving the information of the appearing order to the tag code, the retrieval of the text based on the tag 
information can be performed at a high speed and the reliability can be enhanced. The data compressing apparatus 
further comprises: a tag information storing unit for storing the tag information separated by the tag information sepa- 
rating unit; a code storing unit for storing code data formed by the character train coding unit; and a code switching 
unit for selecting the tag information stored in the tag information storing unit and the code data stored in the code 

40 storing unit and outputting the selected tag information or code data. By individually storing the separated tag infor- 
mation and the code data of the text, the retrieval of the compression data and the management for a transfer request 
can be easily performed. 

[0014] The character train coding unit comprises: a dictionary storing unit for storing a dictionary in which a character 
train sen/ing as a processing unit upon compressbn has been registered; and a character train comparing unit for 

45 comparing a partial character train in the character train stream from the tag code replacing unit with the registered 
character train in the dictionary storing unit to thereby detect the partial character train which coincides with the regis- 
tered character train, allocating a predetermined code to each of the detected partial character trains, and outputting 
it. A coring process by the character train coding unit is effective in the compression of document data formed by 
character codes of a language having a word structure which is not separated by spaces. As a language having the 

so word structure which Is not separated by spaces, for example, there are Japanese, Chinese, Hangul, and the like. 
When considering Japanese as an example, there is a study result of Japan Electronic Dictionary Research Institute 
(EDR) Co., Ltd. regarding Japanese words (Yokoi, Kimura, Koizumi, and Miyoshi, "Infomnation structure of electronic 
dictionary at surface layer level", the papers of Information Processing Society of Japan, Vol. 37, No. 3, pp. 333 - 344, 
1996). In the study result, morphemes constructing Japanese, that is, parts of speech of words are added up. When 

55 words are simply classified into parts of speech class and the parts of speech class are registered, the number of parts 
of speech class is equal to 136,486 and they can be expressed by codes of 17 bits (maximum 262,143). The number 
of characters constructed every word of about 130,000 words constructing a Japanese word dictionary formed by 
Institute for New Generation Computer Technology (ICOT) is detected and a distribution of the words Is obtained. 
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Consequently, it has been found that each of the 70,000 words whose number is more than the half of all of the 
registered words is constructed by two characters and that the average number of characters is equal to 2.8 characters 
(44.8 bits). The dictionary storing unit forms and stores a dictionary in which a character train code of a fixed length 
of, for example, 17 bits is allocated to each word of, for example, about 130,000 words and which is practical as a 
dictionary of Japanese, retrieves a registration character train in the dictionary which coincides with the partial character 
train of the non-compression data, and allocates and outputs the fixed length code of 1 7 bits as a character train code, 
thereby enabling the data amount to be substantially compressed to 1 /2 or less irrespective of the size of document data. 
[0015] The data compressing apparatus of the invention has a tag information compressing unit for compressing the 
tag information separated by the tag information separating unit. The tag information includes a single tag and a com- 
bination of a tag and a character train. The tag information compressing unit compresses the tag information in a lump 
without distinguishing the tag and the character train. An algorithm such as LZ77, LZ78, arithmetic coding, or the like 
is used to perform the compression. The data compressing apparatus of the invention compresses the tag information 
by performing the same coding as that of the character train coding unit of the text to a character train of a language 
such as Japanese or the like which is not separated by spaces in the tag information. That is. the data compressing 
apparatus of the invention is characterized by comprising: a tag dictionary storing unit for storing a dictionary in which 
a tag character train in the tag information as a processing unit upon compression has been registered: and a tag 
character train comparing unit for comparing a partial character train in a character train stream included in the tag 
information separated by the tag information separating unit with a registered character in the tag dictionary storing 
unit to thereby detect the partial character train which coincides with the registered character train, allocating a prede- 
termined code to each of the detected partial character trains, and outputting. By compressing the tag information 
separated as mentioned above, together with the compression of the text by the character train coding unit, the whole 
document file can be compressed at a high compression ratio. 

[0016] The data compressing apparatus of the invention further has a tag position detecting unit for detecting a tag 
position in code data formed by the character train coding unit. Designation information of the tag position detected by 
the tag position detecting unit is stored in the tag information storing unit together with the tag information separated 
by the tag information separating unit. In this case, the tag position detecting unit detects a code amount from the head 
of a document or a specific tag and stores it together with the tag information into the tag information storing unit. Since 
a data amount (the number of bytes) from the document head Indicative of the position of the corresponding tag code 
in the compressed text or a specific tag is stored as position designation information in the separated tag information, 
if the user wants to retrieve a necessary tag from the tag information, the position of a corresponding tag code in the 
compression data of the text can be immediately specified and random access of the required text can be efficiently 
performed. 

[001 7] According to the invention, there is provided a data reconstructing apparatus for reconstructing character train 
data from a code stream including tag information separated from a character train stream of a document including 
tags and code data obtained by encoding a character train stream in which a tag code has been arranged at a position 
of a separated tag. 

[0018] The data reconstructing apparatus is characterized by comprising: a tag information separating unit for sep- 
arating tag information and code data from a code stream; a tag information storing unit for storing the tag information 
separated by the tag information separating unit; and a character train reconstructing unit for reconstructing a character 
train and a tag code from the code data and. after that, replacing the tag code by the tag information in the tag information 
storing unit The character train reconstructing unit executes the operation opposite to that of the character train coding 
unit and comprises: a dictionary storing unit for storing a dictionary in which a reconstruction character train corre- 
sponding to a code of a character train serving as a processing unit upon reconstruction has been registered; a character 
train comparing unit for separating the code of the character train as a reconstruction unit from the code stream and 
reconstructing the original character train by referring to the dictionary storing unit; and a character train replacing unit 
for replacing the tag code reconstructed by the character train comparing unit by the tag information in the tag infor- 
mation storing unit. If the tag information was compressed by L277. LZ78, or the like on the data compressing apparatus 
side, the data reconstructing apparatus of the invention has a lag information reconslrucling unit for reconstructing 
compression data of the tag information stored in the tag information storing unit. If the character train of the tag 
information was encoded on the data compressing apparatus side, the data reconstructing apparatus of the invention 
comprises: a tag dictionary storing unit for storing a dictionary in which a reconstruction character train corresponding 
to a code of a tag character train serving as a processing unit upon reconstruction has been registered; and a tag 
character train comparing unit for separating the code of the tag character train as a reconstruction unit from the tag 
information separated by the tag information separating unit and reconstructing the original tag character train by 
referring to the tag dictionary storing unit. The invention further provides a compressing method and a reconstructing 
method of a structured document including tag information. A data compressing method of forming code data from a 
character train stream constructed by a document including tags according to the invention comprises: 
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a tag information separating step of separating a tag identified from a character train stream and outputting it as 
tag infornnation; 

r a tag code replacing step of arranging a tag code for identification at a position in the character train stream from 
which the tag was separated in the tag information separating step: and 
5 a character train coding step of coding the character train stream including the tag code outputted in the tag code 

replacing step and "outputting the code stream. 

[0019] According to the invention, there is provided a data reconstruction method of reconstructing character train 
data from a code stream including tag information separated from a character train stream of a document including 
10 tags and code data obtained by coding the character train stream in which a tag code has been allocated at a position 
of the separated tag. The reconstructing method comprises: 

a tag information separating step of separating tag information and code data; 

a tag information storing step of storing the tag information separated in the tag information separating step: and 
'5 a character train reconstructing step of reconstructing the character train and the tag code from the code data and, 

after that, replacing the tag information separated in the tag information storing step by the tag code. The details 
of the data compressing method and the reconstructing method are the same as those in the case of the apparatus. 

[0020] The above and other objects, features, and advantages of the present invention will become more apparent 
20 from the following detailed description with reference lo the drawings. 

BRIEF DESCRIPTION OF THE DRAWINGS 

[0021] 

2S 

Fig. 1 is an explanatory diagram of a structure of an SGML document; 

Fig. 2 is an explanatory diagram of a specific example of a document type definition DTD of the SGML document; 
Fig. 3 is an explanatory diagram of an SGML document file with respect to a Japanese document as an example; 
Fig. 4 is a flowchart for a fundamental encoding algorithm to compress an SGML document file; 
30 Fig. 5 is an explanatory diagram of an SGML document compression data file in which the portions of non-com- 

pressed tag information and the portion of a compressed text mixedly exist; 

Fig. 6 is a block diagram of the first embodiment of a data compressing apparatus according to the invention; 
Fig. 7 is a block diagram of a tag information separating unit in Fig. 6; 

Fig. 8 is an explanatory diagram of a processing procedure of the data compressing apparatus in Fig. 6; 
35 Fig. 9 is an explanatory diagram of a text file in which tags in Fig. 8 are replaced by tag codes; 

Fig. 10 is an explanatory diagram of a tag information file separated from a character train stream in Fig. 8; 
Fig. 11 is an explanatory diagram of a text file in which the tags in Fig. 8 are replaced by tag codes with an appearing 
order; 

Fig. 12 is a flowchart for a compressing process of the data compressing apparatus in Fig. 6; 

40 Fig. 1 3 is an explanatory diagram of a research result for a Japanese document: 

Fig. 14 is an explanatory diagram of a dictionary structure of a dictionary storing unit in Fig. 6; 
Figs. 15A and 15B are flowcharts for an encoding process in Fig. 6 using the dictionary structure in Fig. 14; 
Fig. 16 is a block diagram of the first embodiment of a data reconstructing apparatus of the invention for recon- 
structing a code stream from the data compressing apparatus in Fig. 6; 

45 Fig. 1 7 is an explanatory diagram of a dictionary structure of a dictionary storing unit in Fig. 16; 

Fig. 18 is a flowchart for a reconstructing process of the data reconstructing apparatus in Fig. 16; 

Fig. 19 is a block diagram of the second embodiment of a data compressing apparatus of the invention; 

Fig. 20 is a flowchart for a compressing process of the data compressing apparatus in Fig. 19; 

Fig. 21 is a block diagram of the third embodiment of a data compressing apparatus of the invention; 

50 Fig. 22 is an explanatory diagram of a processing procedure of the data compressing apparatus in Fig. 21 ; 

Fig. 23 is a block diagram of the second embodiment of a data reconstructing apparatus of the invention for re- 
constructing a code stream from the data compressing apparatus in Fig. 21; 

Fig. 24 is a block diagram of the forth embodiment of the data compressing apparatus of the invention; 
Fig. 25 is an explanatory diagram of a processing procedure of the data compressing apparatus in Fig. 24; . 
55 Fig. 26 is a flowchart for a data compressing process in Fig. 24; 

Fig. 27 is an explanatory diagram of a tag information file and a tag information stream which are stored in the 

data compressing apparatus in Fig. 24 in which a code amount in Fig. 25 has been added to tags; and 

Fig. 28 is a block diagram of the third embodiment of a data reconstructing apparatus of the invention for recon- 
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structing a code stream from the data compressing apparatus in Fig. 24. 
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS 

[0022] Fig. 6 is a block diagram of the first embodiment of a data compressing apparatus of the invention. The data 
compressing apparatus is constructed by a tag information separating unit 10, a tag code replacing unit 12. and a 
character train coding unit 1 4. The character train coding unit 1 4 has a character train comparing unit 1 6 and a dictionary 
storing unit 18. The tag information separating unit 10 inputs a character train stream 20 read out from, for example, 
an SGML Japanese document file shown in Fig. 3, discriminates tags included in the inputted character train stream 
20, separates the discriminated lags, and outputs them as a tag information stream 28. The tag code replacing unit 
12 arranges a predetermined tag code at a tag position of the character train stream from which tag information has 
been separated by the tag information separating unit 10, and supplies a character train stream 22 in which the tag 
codes have already been arranged to the character train coding unit 14. The character train coding unit 14 encodes 
the character train stream 22 including the tag codes arranged by the tag code replacing unit 12 and outputs a code 
^5 stream 26. 

[0023] Fig. 7 shows the details of the tag information separating unit 10 in Fig. 6 together with the tag code replacing 
unit 1 2. The tag information separating unit 1 0 is constructed by a tag comparing unit 30, a tag identification rule storing 
unit 32, and an output switching unit 34. An identification rule of the tag Information obtained from a document type 
definition DTD in an SGML document has been stored in the tag identification rule storing unit 32. The tag comparing 
unit 30 inputs the character train stream 20 and compares it with the identification rule in the tag identification rule 
storing unit 32. When a comparison output is obtained by the tag information identification, the output switching unit 
34 is switched from an output of character train stream 22 to an output of the tag information stream 28, and outputs 
the identified tag information as a tag information stream 28. At the same time, a comparison result based on the tag 
information identification is outputted to the tag code replacing unit 12. A tag code 24 which has been preset in the tag 
code replacing unit 12 is inserted and arranged from the output switching unit 34 to the position of tag information 
whose output has been stopped. For example, a hexadecimal fixed code "0x0000" is used as tag information 24 ar- 
ranged at the position of the tag infomriation of the character train stream 22 by the tag code replacing unit 12. 
[0024] Fig. 8 is an explanatory diagram of a compressing process according to the data compressing apparatus in 
Fig 6 with respect to the character train stream 20 read out from the SGML Japanese document file. An SGML Japanese 
document file 35 which is inputted as a character train stream 20 for the tag information separating unit 10 is compared 
with the tag identification rule stored in the tag identification rule storing unit 32 by the tag comparing unit 30 provided 
In the tag information separating unit 10 in Fig. 7. For example, the head "<TITLE> Specification of the Invention 
(Device) </TITLE>" is identified as tag information. This tag information is separated like a head position of a tag 
information file 36. In parallel with the separation of the tag information, a tag code using a hexadecimal fixed code 
"0x0000" is inserted and arranged to the position where the tag information in the SGML Japanese document file 35 
has been separated. A character train stream of a tag-replaced Japanese document file 38 is formed by replacing the 
tag information by the tag code. The tag information stream serving as contents of the separated tag information file 
36 is outputted as it is. The character train stream serving as contents of the tag-replaced Japanese document file 38 
is encoded by the character train coding unit 14 and outputted as a code stream 26. 

[0025] Fig. 9 shows the tag-replaced Japanese document file 38 obtained by inputting the character train stream 20 
of the SGML Japanese document file in Fig. 3 to the data compressing apparatus in Fig. 6 and replacing the tag 
information by the fixed tag code by the tag code replacing unit 12. In the tag-replaced Japanese document file, the 
tag information in the SGML Japanese document file in Fig. 3 has been replaced by "(tag code)", respectively. 
[0026] Fig. 10 shows the tag information file 36 of the tag information separated from the character train stream of 
the SGML Japanese document file shown in Fig. 3. The tag information included in the inputted character train stream 
is sequentially separated and stored in the tag information file 36. The tag-replaced character train stream 22 serving 
as contents of the tag-replaced Japanese document file 38 in Fig. 9 is encoded by the character train coding unit 14 
in Fig. 6 and outputted as a compressed code stream 26. 

[0027] Fig. 11 shows the tag-replaced Japanese document file 38 when order tag codes showing an appearing order 
of the tag information are used as tag codes. As order tag codes showing the appearance frequency of the tag infor- 
mation, it is sufficient to use, for example, hexadecimal order tag codes such as "0x001, 0x002, 0x003. ..." which 
unconditionally correspond in accordance with the appearing order of the tags. In case of using the order tag codes 
indicative of the appearing order, as shown in Fig. 11 . the tag codes themselves replaced in the Japanese character 
train data indicate the appearing order from the head of the document like "(tag code 1), (tag code 2). (tag code 3), ... 
55 Therefore, when the position of the corresponding tag code in the document file in Fig. 11 is specified by searching 
the tag information separated as shown in Fig. 1 0. the searching position in the text can be easily and certainly specified. 
For example, if the user wants to know the position in the document file of the tag information "<SECTION> Scope of 
Claim </SECTION>" at line 5 in Fig. 10, since the tag identification information appears at the fifth line from the head, 



20 



25 



30 



35 



40 



45 



SO 



6 



EP0 991 018 A2 



it can be easily specified by searching the position of '(tag code 5)" in which the appearing order is equal to No. 5. 
[0028] Fig. 12 is a flowchart for a connpressing process by the data compressing apparatus in Fig. 6. First in step 
-SI . the tag information is separated fronn the character train stream 20 of the input document by the tag infonmation 
separating unit 10 and outputted. In step S2, the tag code for identification is inserted to the position where the tag 

5 exists in the character train stream 20 of the input document by the tag code replacing unit 12. In step S3, the corre- 
sponding registration number in the dictionary storing unit 18 is allocated as a code to the character train in the tag- 
replaced character train stream by the character train comparing unit 16 provided in the character train coding unit 14, 
and the code stream 26 is outputted. The processes in steps Si to S3 are repeated until the input of the character 
train stream is finished in step S4. 

10 [0029] The coding process of the tag-replaced character train stream 22 by the character train comparing unit 16 
and dictionary storing unit 18 provided in the character train coding unit 1 4 in Fig. 6 will now be described. The character 
train comparing unit 16 provided in the character train coding unit 14 in Fig. 6 performs the encoding to allocate a 
predetermined character train code to each character train constructing a word with reference to the dictionary storing 
unit 18. First, for example, Japanese document data will now be considered as document data as a target to be com- 

is pressed in the character train comparing unit 1 6. In case of Japanese document data, one character is constructed by 
word data of two bytes and a word in the document has a structure such that it is not divided by spaces. The Japanese 
document data is inputted on a unit basis of a document which is used for compression of one time and a document 
of a proper size on the order of kilobyte to megabyte is inputted. The character train comparing unit 16 sequentially 
inputs the character trains of the Japanese document data from the head and detects whether they coincide with the 

20 registration character trains of a word unit which have previously been registered in the dictionary storing unit 18 or 
not. When the registration character train which coincides with the input character train is detected in the character 
train comparing unit 16, the character train code which has previously been registered in correspondence to the coin- 
cidence detected registration character train in the dictionary storing unit 18 is read out and allocated. This character 
train code is outputted. 

25 [0030] The dictionary storing unit 18 to convert the character train of the Japanese document data into a character 
train code on a word unit basis will now be described. Fig. 13 is a sum result regarding parts of speech of morphemes 
constructing Japanese published by Japan Electronic Dictionary Research Institute (EDR) Co., Ltd. as a study result. 
According to the sum result, the number of morphemes corresponding to the number of words is equal to 136,488. 
When the number of words is expressed by binary numbers, they can be expressed by codes of 17 bits where the 

30 maximum number of expression items is equal to 262,143. On the other hand, as a result of obtaining a distribution 
by detecting the number of characters constructing the words from the Japanese dictionary having about 130,000 
words formed by Institute for New Generation Computer Technology (ICOT), each of 70.000 words which are equal to 
or larger than 1/2 of all of the registered words is constructed by two characters and the average number of characters 
is equal to 2.8 characters. When the average number of characters (2.8 characters) is expressed by the number of 

35 bits, it is equal to 

2.8 charac-ters x 2 bytes = 5.6 bytes x 8 bits 

40 = 44.8 bits 

[0031] According to the invention, by executing a coding such that a character train code of 17 bits expressing each 
of the 136,486 words in Fig. 13 is preliminarily allocated and the character train of the inputted Japanese data is 
converted to the character train code of 17 bits on a word unit basis, the data amount can be substantially compressed 
45 to the half or less. 

[0032] Fig. 14 shows an embodiment of a dictionary structure of the dictionary storing unit 18 in Fig. 6. The dictionary 
stored in the dictionary storing unit 1 8 in Fig. 6 has a double-layer structure of a head character storing unit 40 and a 
dependent character train storing unit 42. The head character storing unit 40 uses character codes of Japanese char- 
acters "^.^^^ ^, . .- (which pronounce a, i, u, e. o, ... in the Roman alphabets)" as indices. Since the Japanese character 

50 code is two-byte data, as character codes 44, 131,072 kinds of storing positions from "0x0000" to "OxFFFF" as hex- 
adecimal numbers are allocated. The character code 44 accesses to the position of the corresponding character code 
by using the head character read by the character train comparing unit 16 in Fig. 6. A head address 46 is stored after 
the character code 44. When the head character (a)" of the character code 44 is taken as an example, the head 
address 46 designates a head address "Al" in the dependent character train storing unit 42 in which the dependent 

55 character train subsequent to the head character (a)" has been stored. Subsequently, the number of dependent 
character trains (48) is provided. For example, in case of the head character (a)", (Nl =4) is stored as the number 
of dependent character trains (48). In the dependent character train storing unit 42. the head position is designated by 
the head address 46 stored in correspondence to the character code 44 of the head character in the head character 
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storing unit 40 and the dependent character trains are stored at the storing positions of the number designated by the 
dependent character train storing unit 42 from the head position. For example, four storing positions when the number 
of dependent character trains (48) is (N1 = 4) are designated as dependent character train storing regions as targets 
from the address A1 of the head address 46 corresponding to the head character (a)". In the dependent character 
train storing unit 42. a length 50 of dependent character train from the head, a dependent character train 52, and a 
character train code 54 which is expressed by 17 bits are stored. In the head address A1, for instance, a dependent 
character train (j)" having a length of L1 and its character train code are stored. A dependent character train ""^(u) 
" having a length L2 is stored together with its character train code at the next storing position. In the third region, a 
dependent character train "^^ (o)" having a length L3 is stored together with its character train code, fn the fourth storing 
region, a code "NULL" indicating that a dependent character train having a length L4 does not exist is stored and a 
character train code indicative of the absence is stored. That is. the fourth storing region shows registration of the 
character train code of only one head character. 9 ^ i,^ ^v.f^l i^..^^^L Sfe of the head character codes 

24inFig.5and" v^^ fjv . rtl . . . * 7 ^ . ."of the dependent character trains 32 are examples of Japanese 
characters each expressed by a 2-byte code and are expressed by the Roman alphabets as "fc (a), (i), 9 (u), L 
(e),t^ (o),7i\ (ka)...i|S:(an),S(an),^f!?(an), ...,«(wan),^fe(wan)"and"P(i),-7(u),?i' (o), .... ft(ken).fl (nat) '7!? 
(chikara), i'C{tate),tj (nnae) ...". 
[0033] The first to 136,486th character train codes of 17 bits have preliminarily been allocated as character train 
codes 54 in the dependent character train storing unit 42 in Fig. 1 4 on the basis of the number of words and the relation 
between a character train code K and a position address X in case of storing as shown in Fig. 14 can be expressed 
^0 by the following equation. 

K^(NoX-A1)/M (1) 

25 where, 

X: position address in the dependent character train storing unit 42 

N: number (1 , 2. 3 N) of the dependent character train in which the coincidence has been detected 

A1 : start address in the dependent character train storing unit 
30 M: storage byte length in the dependent character train storing unit 

[0034] Since the storage byte length (M) in the dependent character train storing unit 42 is equal to the total length 
of the length 50 of dependent character train, dependent character train 52. and character train code 54, it can be 
expressed by. for example, the following equation. 



35 



40 



SO 



55 



Storage byte length M = length + character code train 

+ character train code 

... (2) 

= 3 bits + 96 bits + 17 bits 
= 116 bits 
= 15 bytes 

[0035] A case of allocating 96 bits to the dependent character train 52 by setting the maximum number of characters 
of the dependent character train which can be stored to six characters is shown here as an example. It will be obviously 
understood that since the average number of characters of the dependent character train is equal to 2.8 characters, 
if the maximum number of characters is set to three characters (48 bits) or larger, a sufficient compressing effect can 
be obtained. In this case, the storage byte length (M) of one storing region in the dependent character train storing unit 
is equal to 12 bytes. When the character train code (K) of 17 bits which is calculated by the equation (1) is used, it is 
sufficient to calculate the storing position (address) X from the value of the character train code (K) by the following 
equation at the time of reconstruction. 
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X = M«K + A1 (3) 

where, 

5 

K: character train code 

A1: start address in the dependent character train storing unit 

M: storage byte length in the dependent character train storing unit on the reconstruction side 

10 [0036] In the equation (3), since the start address A1 in the dependent character train storing unit 42 in the dictionary 
which is used on the reconstruction side, that is, an offset and the storage byte length (M) of the dependent character 
train storing unit 42 have been determined as constants, by substituting the character train code (K) to be reconstructed 
into the equation (3), the dictionary position (position address) X in which the character train to be reconstructed has 
been stored can be unconditionally calculated. 

15 [0037] Figs. 1 5A and 1 5B are flowcharts for the encoding process by the character train comparing unit 1 6 in Fig. 6 
by the dictionary storing unit 18 having the dictionary structure of Fig. 14. First, in step SI. a pointer is moved to a 
position P of the head character of the character train read to the character train comparing unit 1 6. A table in the head 
character storing unit 40 corresponding to the character code 44 in Fig. 14 shown by the character code at the head 
character position P is referred to in step S2. With reference to the table in the head character storing unit 40, the head 

20 address 46 and the number of dependent character trains (48) in the dependent character train storing unit 42 are 
obtained in step S3. Subsequently, in step S4, length data L of the length 50 of dependent character train is obtained 
from the head data in the head address in the dependent character train storing unit 42. In step 85, L characters based 
on the length data L of the dependent character train are extracted from the head character position R the extracted 
L characters are compared with the registration character train of the dependent character train 52 in the dependent 

25 character train storing unit 42, thereby discriminating whether they coincide or not. When the extracted L characters 
coincide with the registered dependent character train, the processing routine advances to step 88, the next character 
train code 54 is read out and is allocated to the coincidence detected character train by the character train comparing 
unit 16, and the resultant character train is outputted. In step 89. the pointer at the head character position P is updated 
to the position P where it is moved by only the number L of characters of the dependent character train. If a process 

30 for non-compression data is not finished in step 812, the processing routine is again returned to step 82 and similar 
processes are repeated with respect to the updated head character position P. On the other hand, when the extracted 
character does not coincide with the registration dependent character train in the dependent character train storing 
unit 42 in step 85, a check is made to see whether the process to the number (N) of dependent character trains has 
been finished or not. If it is not finished yet. the processing routine is returned to step 87. The length data L of the 

35 dependent character train is obtained from the next storing region in the head address in the dependent character train 
storing unit 42. The dependent character train of the L characters is extracted again from the head character position 
P in step 85 and is compared with the registration dependent character train in the dependent character train storing 
unit 42 to see whether they coincide or not. In a case where they do not coincide even when the connparing process 
is performed with respect to all of the dependent character trains of the registration number (N) by repetition of steps 

40 85 to 87, the end of the number (N) of dependent character trains is discriminated in step 86. The processing routine 
advances to step 810 and a non-registered code indicative of one character of the head character is transmitted. In 
step 811, the pointer is updated to a next position where the head character position P has been moved only by the 
number (L) of characters (L = 1 ). The processing routine is returned from step 812 to step 82 and the processes from 
the next head character position P are repeated. 

45 [0038] Fig. 16 is a block diagram of the first embodiment of a data reconstructing apparatus for reconstructing a 
character train stream from the code stream which is outputted from the data compressing apparatus in Fig. 6 and 
constructed by the code stream 26 and tag information stream 28. The data reconstructing apparatus comprises a tag 
information separating unit 60, a lag information storing unit 62, and a character train reconstructing unit 64. The 
character train reconstructing unit 64 has a code train comparing unit 66, a dictionary storing unit 65, and a character 

50 train replacing unit 68. The tag information separating unit 60 inputs a code stream 56 sent from the data compressing 
apparatus side in Fig. 6 and separates it into the tag information and the code data. The tag information is stored into 
the tag information storing unit 62. The code data is outputted as a code stream 56 to the character train reconstructing 
unit 64. The character train reconstructing unit 64 reconstructs the character train and the tag code from the code data 
in the code train comparing unit 66 by using the dictionary storing unit 65. After that, in the character train replacing 

55 unit 68, the tag code is replaced by the tag information stored in the tag information storing unit 62 and a reconstructed 
character train stream 70 is outputted. 

[0039] Fig. 17 is a flowchart for the reconstructing process of the data reconstructing apparatus in Fig. 16. First in 
step 81 , the tag information separating unit 60 separates the tag information from the code stream 56 corresponding 
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to the input document and stores it into the tag information storing unit 62, In step S2. the code train in the code stream 
56 from which the tag information has been separated is compared and collated with the registration number in the 
dictionary storing unit 65 and converted into the character or character train stored by the coincident registration number 
In step S3, the tag codes included in the reconstructed character train are sequentially replaced in accordance with 
5 the storing order of the tag Information stored in the tag information storing unit 62 and outputted as a reconstructed 
character tram stream 70. The processes in steps Si to S3 are repeated until the input of the code stream 56 is finished 
in step S4. With reference to the dictionary storing unit 65, the code train comparing unit 66 provided in the character 
tram reconstructing unit 64 in Fig. 16 reconstructs the original character train from the code train stream encoded by 
the data compressing apparatus in Fig. 6. 

10 [0040] Fig. 18 shows a dictionary structure of the character train dictionary storing unit 65 in Fig 16 In the character 
tram dictionary storing unit 65. a head character 72, a dependent character train length 74. and a dependent character 
tram 76 have been stored in accordance with the order of the character train code 54 of 17 bits in the dependent 
character train storing unit 42 shown in the dictionary structure in Fig. 1 4. Therefore, in the code train comparing unit 
66. since the storage byte length M of the dependent character train storing unit 42 which Is used for reconstruction 

?5 has been known from 



20 



2S 



storage byte length M = head character + length 
+ character code train 
= 16 bits + 3 bits + 96 bits 
= 115 bits 

= 15 bytes, . . , (6) 



30 



the position address X corresponding to the character train code K can be calculated from the following equation 



40 



45 



SO 



SB 



X - MoK + A1 (7) 



35 where, 



K: character train code 

A1 : start address of character train storing position 
M: storage byte length 

[0041] By obtaining and referring to the position address X showing the dictionary storing position from the separated 
character tram code K as mentioned above, the character train comprising a combination of the corresponding head 
character and dependent character train can be reconstructed. 

[0042] By the data compressing apparatus of Fig. 6 and the data reconstructing apparatus of Fig 16 as mentioned 
above, the character train stream of the SGML Japanese document file shown in Fig. 3 is separated into the tag 
information as shown in Fig. 10 and the character train stream in which the tag information is replaced by the tag code 
as shown in Fig. 9. In the embodiment, by encoding the character train stream which has already been replaced to the 
tag code, the portion corresponding to the text of the document file can be converted into a compression file of a high 
compression ratio. The tag information separated as shown in Fig. 10 is retrieved by using a keyword and if the tag 
information which coincides with the keyword is obtained, to which number the appearing position of the tag information 
corresponds is detected. Thus, by retrieving the appearing position of the tag code included in the document file of the 
tag code-replaced text in Fig. 9. the reading operation by specifying the document position correspondingto the retrieval 
result of the tag information or the like can be easily performed. 

[0043] Fig. 19 shows the second embodiment of a data compressing apparatus of the invention. The embodiment 
IS characterized by providing a tag information storing unit 78 and a code storing unit 80 in addition to the first embod- 
iment of Ftg. 6. The tag information separated from the character train stream 20 by the tag information separating unit 
10 IS stored into the tag information storing unit 78. Thus, for example, the tag information file 36 as shown in Fig 10 
IS stored into the lag information storing unit 78. The code storing unit 80 is provided in the character train coding unit 
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1 4. The code data formed by the coding process in Fig. 1 5 is stored into the code storing unit 80 with respect to the 
tag-replaced character train stream 22 obtained by inserting the tag information into the tag information separated by 
-the tag code replacing unit 12. Besides the tag information storing unit 78. and code storing unit 80. a code switching 
unit 82 is provided at the output stage. The code switching unit 82. for example, sequentially selects the tag information 
5 stored in the tag Information storing unit 78 and the code data stored in the code storing unit 80 and outputs them as 
a code train stream 84. 

[0044] Fig. 20 is a flowchart for a compressing process of the data compressing apparatus of Fig. 19. In the com- 
pressing process, in step SI , the tag information is separated from the character train steam 20 of the input document 
by the tag information separating unit 10 and stored into the tag information storing unit 78. In step S2. a tag code for 

10 identification is inserted to a position where the tag exists in the character train stream 20 by the tag code replacing 
unit 12. In step S3, the character train of the character train stream 22 after completion of the replacement of the tag 
code is inputted to the character train comparing unit 16 of the character train coding unit 14 and converted into the 
corresponding registration number of the dictionary structure in the dictionary storing unit 18. The processes in steps 
SI to S3 as mentioned above are repeated until the input of the character train stream is finished in step S4. When 

IS the input of the character train stream is finished, step S5 follows. The code streams encoded by converting into the 
separated tag information and tag code are sequentially read out from, for example, the tag information storing unit 
78 and code storing unit 80 and outputted as a code train stream 84. By inputting the code train stream 84 outputted 
from the data compressing apparatus in Fig. 19 to the data reconstructing apparatus shown in Fig. 16, the character 
train stream can be reconstructed. 

20 [0045] Fig. 21 shows the third embodiment o\ a data compressing apparatus of the invention. The embodiment is 
characterized in that the tag information separated from the character train stream is compressed. In the data com- 
pressing apparatus, a tag information compressing unit 86 is newly provided between the tag information separating 
unit 10 and tag information storing unit 78 in the second embodiment in Fig. 19. The tag information compressing unit 
86 compresses the tag information inputted and separated from the character train stream 20 by the tag information 

25 separating unit 10 as a character train stream as a target of the compression and stores it into the tag information 
storing unit 78. As for the compressing process by the tag information compressing unit 86, a compression algorithm 
such as LZ77, L278, arithmetic encoding, or the like is used since the tags and the Japanese character train are 
included in the tag information and they are compressed in a lump. The tag information separating unit 10, tag code 
replacing unit 12, and character train coding unit 14 are the same as those in the second embodiment of Fig. 19. Fig. 

30 22 is an explanatory diagram of the compressing process by the data compressing apparatus of Fig. 21 . The character 
train stream 20 serving as contents of the SGML Japanese document file 35 is separated into the tag information 
serving as contents of the tag information file 36 by the tag information separating unit 10. After the tag information 
was compressed by the tag information compressing unit 86. it is outputted via the storage of the tag information storing 
unit 78. A fixed tag code or an order tag code indicative of the appearing order is inserted and arranged by the tag 

35 code replacing unit 12 to the position of the tag information separated from the character train stream 20 serving as 
contents of the SGML Japanese document file 35. The character train stream 22 serving as contents of the tag-replaced 
Japanese document file 38 is outputted to the character train coding unit 1 4. The code data compressed by the character 
train encoding is outputted via the storage by the code storing unit 80. 

[0046] Fig. 23 shows the second embodiment of a data reconstructing apparatus of the invention for reconstructing 

40 a character train stream from a code stream 90 outputted from the data compressing apparatus in Fig. 21 . The data 
reconstructing apparatus further has a compression tag storing unit 92 and a tag information reconstructing unit 94 in 
addition to the first embodiment of Fig. 16. The tag information separating unit 60 separates the compression tag 
information included in the code stream 90 which is inputted and stores the separated compression tag information 
into the compression tag storing unit 92. The compression tag information stored in the compression tag storing unit 

45 92 is reconstructed by the tag information reconstructing unit 94 and stored into the tag information storing unit 62. 
The tag information reconstructing unit 94 executes a reconstruction algorithm corresponding to 1-Z77, LZ78, or arith- 
metic decoding on the data compression side. The other construction is substantially the same as that in Fig. 19. 
[0047] Fig. 24 shows the fourth embodiment of a data compressing apparatus of the invention. The embodiment is 
characterized in that the Japanese character train in the separated tag information is compressed by encoding and, 

50 further, position designation information indicative of the position of the replaced tag code in the text is added to the 
separated tag information. In the fourth embodiment, the tag information separating unit 10, the tag code replacing 
unit 12, the character train coding unit 14 having the character train comparing unit 16 and dictionary storing unit 18, 
the tag information storing unit 78, and the code switching unit 82 are substantially the same as those in the second 
embodiment of Fig. 19. Besides them, a tag character train comparing unit 97, a tag dictionary storing unit 96, and a 

55 code amount measuring unit 98 are newly provided, tn the tag character train comparing unit 97 and tag dictionary 
storing unit 96, the Japanese character train stream included in the tag information separated by the tag information 
separating unit 10 is encoded by a coding algorithm similar to that in the character train coding unit 14. thereby com- 
pressing the tag information. Therefore, a dictionary structure in the tag information storing unit 78 is the same as that 
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in Fig. 14 and the Japanese character train which is used in the tag information is used as a head character and 
dependent characters. The coding process of the tag character train is performed in accordance with the flowcharts 
of Figs. 15A and 15B. The code amount measuring unit 98 provided in the data compressing apparatus measures a 
code amount in a range from the head of the character train stream to each replaced tag code with respect to the code 
data due to the encoding with regard to the character train stream 22 of the text by the character train coding unit 1 4 
namely, the character train stream 22 in which the replacement of the tag code was finished as a target The code 
amount measuring unit 98 adds a measurement result of the code amount to each tag code as code position information 
to each of the tag information separated from the character train stream to be stored into the tag information storing 
unit 78 and stores it. As position designation information indicative of the position of the tag information replaced by 
the tag code by the code amount measuring unit 98, besides the code amount from the head of the character train 
stream, a code amount of the code data in a range from specific tag Information in the character train stream to each 
subsequent tag information can be used. 

[0048] Fig. 25 is an explanatory diagram of acompressing process in the fourth embodiment of Fig 24 The processes 
such that the character train stream serving as contents of the SGML Japanese document file 35 is inputted the tag 
information file 36 by the separation of the tag information is formed, and the tag-replaced Japanese document file 38 
in which the tag information was replaced by the tag code is formed are substantially the same as those in the second 
embodiment of Fig. 19. Besides them, a tag character train as a Japanese character train Included in the tag information 
in the separated tag information file 36 is encoded and compressed by using the tag dictionary storing unit 96 thereby 
outputting. * = . r 

20 [0049] Fig. 26 shows a specific example of the lag information file stored in the tag information storing unit 78 and 
relates to the tag information, as an example, separated from the SGML Japanese document file shown in Fig 3 Code 
amounts (byte amounts) DL1 to DL13 from the head of the code data of the character train data in the tag-replaced 
Japanese file 38 in Fig. 25 have been stored as position designation information 106 on the right side in the tag infor- 
mation file 36 in correspondence to each tag corresponding to indices 01 to 13 on the left side, respectively 
[OOSO] Fig. 27 is a flowchart for the compressing process according to the fourth embodiment of Fig 24 First steps 
SI to S4 are the same as those in Fig. 12 The tag information separated from the character train stream 20 'by the 
tag information separating unit 10 is stored into the tag information storing unit 78. The character train stream 22 in 
which the tag code 24 has been inserted and arranged to the position of the tag information separated by the tag code 
replacing unit 12 is encoded by the character train coding unit 14. The code data is stored into the code storing unit 
80. In step 84. when the replaced tag code is encoded by the character train coding unit 14. the code amount measuring 
unit 98 measures, for example, a code amount DL from the head of the character train stream The measured code 
amount DL is stored as position designation information 106 in Fig. 26 into the tag information already stored in the 
tag information storing unit 78. The processes in steps SI to S4 are repeated until the input of the character train 
stream is finished in step S5. When the input of the character train stream 20 is finished, in step S6 the coding process 
for converting the character train in the tag information separated and stored in the tag information storing unit 78 into 
the corresponding block number of the dictionary in the tag dictionary storing unit 96 and using it as code data is 
executed by the tag character train comparing unit 97. The resultant data is stored into the tag information storing unit 
I, Tr"^'J^® contents stored in the tag information storing unit 78 are as shown in the compression tag information 
tile 36 in Fig. 26. In step S7, finally the tag infonnation with the code amount which was separated and encoded by 
the tag information storing unit 78 and the code data stored in the code storing unit 80 are, for example, sequentiallv 
selected and outputted by the code switching unit 82 and supplied as a code stream 100 to the outside In the com- 
pressing process in Fig. 27, the separation and replacement of the tag information in steps SI to S4 and further, the 
measuring process of the amount of compressed codes and the subsequent coding process of the separated taa 
information are time-dlvisionally performed. However, both of them can be processed in parallel 
45 [0051] Fig. 28 shows the third embodiment of a data reconstructing apparatus of the invention for reconstructinq a 
character tram stream from the code stream 100 outputted from the data compressing apparatus in Fig 24 In the 
embodiment, the tag information separating unit 60, compression tag storing unit 92. tag information storing unit 62 
and character tram reconstructing unit 64 are substantially the same as those in the second embodiment in Fig 23' 
Besides them, a tag character train reconstructing unit 102 and a tag reconstruction dictionary storing unit 104 are 
newly provided. As a tag reconstruction dictionary storing unit 104, the unit having the same dictionaiy structure as 
that in Fig. 17 is used and the stored characters become the Japanese character train which is used in the tags The 
tag information separating unit 60 separates the tag information stream as shown in the contents of the compression 
tag information file 36 in Fig. 26 from the code stream 1 00 which is supplied from the data compressing apparatus side 
m Fig. 24 and stores it into the compression tag storing unit 92. The compression tag information stored in the com- 
pression tag storing unit 92 is reconstructed to the corresponding Japanese character train with reference to the dic- 
tionary number by the code of the lag character train in the tag reconstruction dictionary storing unit 104 by the taq 
character train reconstructing unit 102. The tag information including the reconstructed Japanese character train is 
stored into the tag information storing unit 62. The tag information separating unit 60 supplies the code stream of the 
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document text that is sent after the compression tag information stream to the character train reconstructing unit 64. 
In the code train comparing unit 66. the corresponding characters or character train is reconstructed with reference to 
- the dictionary number in the dictionary storing unit 65 by the extracted code and outputted to the character train replacing 
unit 68. The character train replacing unit 68 recognizes the tag code in the reconstructed character train, sequentially 

5 extracts the reconstructed tag intormatbn stored in the tag information storing unit 62 in accordance with the appearing 
order, replaces it by the tag code, and outputs the reconstructed character train stream. As shown in Fig. 26, the 
compression tag information file 36 has been stored in the compression tag storing unit 92 at a time point when the 
input of the compression tag information stream separated from the code stream 100 is finished. Therefore, the com- 
pression tag information file 36 is retrieved by using a specific tag as a keyword. If the coincident tag is obtained, the 

10 code amount DL as position designation information corresponding thereto is read out. It is possible to request the 
data compressing apparatus of Fig. 24 to transfer the code data from the position of the retrieved code amount DL. 
Thus, by transferring the partial compression text data of the SGML Japanese document which is necessary from the 
data reconstructing side to the data compressing side, the data can be easily read. 

[0052] As mentioned above, according to the invention, with respect to the character train stream of the structured 

75 document such as SGML or the like including the tags, a high compression ratio is realized by separating the tag 
information and the text (character train) and encoding at least the text. By retrieving the separated tag information, 
the reading and the retrieval of the specific tag position in the compressed code data can be processed at a high speed. 
That is. the order of the separated tag information and that of the tag codes replaced in the code data correspond in 
a one-to-one correspondence relation. By retrieving the specific tag information with respect to the tag information, the 

20 position of Ihe tag code in the code data can be specified by such orders. It is possible to easily reach the head position 
of the target document code data. Thus, with respect to the structured document such as an SGML including the tags, 
the compression and reconstruction can be performed at a high speed while maintaining a high compression ratio. 
[0053] As a transmitting form from the data compressing apparatus to the data reconstructing apparatus in the in- 
vention, a communication line such as Internet or the like or a proper form of a rewritable portable medium such as 

25 optical disk cartridge, magnetic disk cartridge, or the like can be used. In the foregoing embodiments, as a compression 
of the character train stream in which the tag information is separated and the tag code is replaced to the position of 
the separated tag information, the encoding in which the character train code of a fixed length corresponding to the 
number of words peculiar to Japanese is allocated is performed as an example. However, it will be obviously understood 
that the compression by LZ77, LZ78. arithmetic encoding, or the like other than the above method can be performed. 

30 Further, the invention is not limited by the numerical values in the foregoing embodiments. Further, the invention in- 
corporates many modifications and variations within the purview of the invention without departing' from the objects 
and advantages thereof. 



35 Claims 

1 , A data compressing apparatus for generating code data from a character train stream constructed by a document 
including tags, comprising: 

40 a tag information separating unit for separating the identified tag from said character train stream and outputting 

as tag information; 

a tag code replacing unit for arranging a tag code for identification to a position of the character train stream 
in which the tag was separated by said tag information separating unit; and 

a character train coding unit for coding the character train stream including the tag code outputted from said 
45 tag code replacing unit and outputting a code stream. 

2. An apparatus according to claim 1, wherein said tag code replacing unit arranges a predetermined fixed code as 
said lag code to the position of the character train stream in which the lag was separated. 

so 3. An apparatus according to claim 1 . wherein said tag code replacing unit arranges a tag code indicative of an 
appearing order of the tag separated by said tag information separating unit to the position of the character train 
stream in which the tag was separated. 

4. An apparatus according to claim 1 , further comprising: 

55 

a tag information storing unit for storing the tag information separated by said tag information separating unit; 

a code storing unit for storing the code data formed by said character train coding unit; and 

a code switching unit for selecting the tag information stored in said tag information storing unit and the code 
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data stored in said code storing unit and outputting the selected tag information or code data. 

5. An apparatus according to claim 1 , wherein said character train coding unit comprises: 

a dictionary storing unit for storing a dictionary in which a character train serving as a processing unit when 
compressing has been registered; and 

a character train comparing unit for comparing a partial character train in the character train stream from said 
tag code replacing unit with the registration character train in said dictionary storing unit, thereby detecting a 
partial character train which coincides with said registration character train, allocating a predetermined code 
every said detected partial character train, and outputting a resultant character train. 

6. An apparatus according to claim 1 , further comprising a tag information compressing unit for compressing the tag 
information separated by said tag information separating unit. 

7. An apparatus according to claim 1 , further comprising: 

a tag dictionary storing unit for storing a dictionary in which a tag character train in the tag information serving 
as a processing unit when compressing has been registered: and 

a tag character train comparing unit for comparing the partial character train of the character train stream 
included m the tag information separated by said tag information separating unit with the registration character 
tram in said tag dictionary storing unit, thereby detecting a partial character train which coincides with said 
registration character train, allocating a predetermined code every said detected partial character train and 
outputting a resultant character train. 

8. An apparatus according to claim 4. further comprising a tag position detecting unit for detecting a position of the 
tag in the code data formed by said character train coding unit. 

and wherein both the tag information separated by said tag information separating unit and designation 
information of the tag position detected by said tag position detecting unit are stored in said tag information storing 
unit ^ 

9. An apparatus according to claim 8. wherein said tag position detecting unit detects the code amount from the head 
of a document or a specific tag and stores it together with the tag information into said tag information storing unit. 

1 0. A data reconstructing apparatus for reconstructing character train data from a code stream including tag information 
separated from a character train stream of a document including tags and code data obtained by encoding a 
character tram stream in which a tag code has been arranged at a position of the separated tag, comprising; 

a tag information separating unit for separating said tag information and said code data from said code stream- 
^0 an? '"^^^^^^'^"^ ^^^^^'^9 information separated by said tag information separating unit! 

a character train reconstructing unit for reconstructing the character train data including the character train 
and the tag code from said code data and. thereafter, replacing said tag code by the tag information in said 
tag information storing unit. 
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55 



An apparatus according to claim 10, wherein said character train reconstructing unit comprises: 

a dictionary storing unit for stohng a dictionary in which a reconstruction character train corresponding to a 
code of the character train serving as a processing unit when reconstructing has been registered 
a character train comparing unit for separating a code of the character train serving as a reconstruction unit 
from said code stream and reconstructing the original character train with reference to said dictionary storina 
unit; and ' ^ 

a character train replacing unit for replacing the tag code reconstructed by said character train comparing unit 
by the tag information in said tag information storing unit. 

12. An apparatus according to claim 10, further comprising a tag information reconstructing unit for reconstructing 
compression data of the tag information stored in said tag information storing unit. 

13. An apparatus according to claim 10, further comprising: 
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a tag dictionary storing unit for storing a dictionary in which a reconstruction character train corresponding to 
a code of a tag character train serving as a processing unit when reconstructing has been registered; and 
a tag character train comparing unit for separating a code of the tag character train serving as a reconstruction 
unit fronn the tag infomnation separated by said tag intornnation separating unit and reconstructing the original 
5 tag character train with reference to said dictionary storing unit. 

14. A data compressing method of generating code data from a character train stream constructed by a document 
including tags, comprising: 

10 a tag information separating step of separating the identified tag from said character train stream and outputting 

as tag information; 

a tag code replacing step of arranging a tag code for identification to a position of the character train stream 
in which the tag was separated in said tag informatbn separating step; and 

a character train coding step of coding the character train stream including the tag code outputted from said 
'5 tag code replacing step and outputting a code stream. 

15. Computer software for implementing the method of claim 14. 
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